Hybrid time series models with exogenous variable for improved yield forecasting of major Rabi crops in India

Accurate and in-time prediction of crop yield plays a crucial role in the planning, management, and decision-making processes within the agricultural sector. In this investigation, utilizing area under irrigation (%) as an exogenous variable, we have made an exertion to assess the suitability of different hybrid models such as ARIMAX (Autoregressive Integrated Moving Average with eXogenous Regressor)–TDNN (Time-Delay Neural Network), ARIMAX–NLSVR (Non-Linear Support Vector Regression), ARIMAX–WNN (Wavelet Neural Network), ARIMAX–CNN (Convolutional Neural Network), ARIMAX–RNN (Recurrent Neural Network) and ARIMAX–LSTM (Long Short Term Memory) as compared to their individual counterparts for yield forecasting of major Rabi crops in India. The accuracy of the ARIMA model has also been considered as a benchmark. Empirical outcomes reveal that the ARIMAX–LSTM hybrid modeling combination outperforms all other time series models in terms of root mean square error (RMSE) and mean absolute percentage error (MAPE) values. For these models, an average improvement of RMSE and MAPE values has been observed to be 10.41% and 12.28%, respectively over all other competing models and 15.83% and 18.42%, respectively over the benchmark ARIMA model. The incorporation of the area under irrigation (%) as an exogenous variable in the ARIMAX framework and the inbuilt capability of the LSTM model to process complex non-linear patterns have been observed to significantly enhance the accuracy of forecasting. The performance supremacy of other hybrid models over their individual counterparts has also been evident. The results also suggest avoiding any performance generalization of individual models for their hybrid structures.


Autoregressive integrated moving average (ARIMA) model
The most commonly used models to represent linear dynamics in time series literature are the ARIMA models 41 .We state that a univariate process {y t } conforms to the ARIMA (p, d, q) model if it can be expressed as follows: where p, d, and q represent the orders of autoregression, differencing, and moving average, respectively.
(1) φ p (B)∇ d y t = θ q (B)ε t {ε t } is hypothesised to adhere to a standard white noise process, exhibiting a normal distribution with zero mean and a variance of σ 2 .In case y t does not undergo mean-adjustment, there is provision to append a constant term, denoted by μ, to the right side of Eq. (1).The ARIMA methodology can be segmented into three essential stages: identification, estimation, and diagnostic checking.In the identification stage, the parameters for the ARIMA model are provisionally chosen.These tentatively selected parameters are then quantified in the estimation stage.During the subsequent diagnostic checking stage, the model adequacy is evaluated thoroughly.If the model is deemed unsuitable, the entire three-step process resumes and continues until an apt ARIMA model for the given time series is obtained.

Autoregressive integrated moving average with exogenous regressor (ARIMAX) model
The ARIMAX model is a more sophisticated version of the ARIMA model.It has the ability to incorporate an external input variable 42 .It operates by assuming a form predicated on a given historical input vector x t : where φ p (B) and θ q (B) assume the predefined form of the ARIMA model and ν(B)x mt is defined as: In the model, m denotes the number of exogenous input variables.Additionally, it is presumed that {ε t } fol- lows a white noise process with N(0, σ 2 ).

Time-delay neural network (TDNN) model
Artificial neural networks, modelled after the human brain, are composed of mathematical functions known as artificial neurons (nodes).These neurons are grouped together to form a layer of processing elements.Typically, neural networks are structured with three layers: the input, hidden, and output layers.
An approach to forecast time series with neural networks involves incorporating dynamic characteristics into a static structure, like a multilayer perceptron.This approach offers an implicit functional depiction of time, which is useful in portraying the behavior of data that evolves over time 43 .A potential approach to incorporate short-term memory is using time delay as input 44 .TDNN exemplifies such a structure.The general expression for the final output y t of a TDNN model is expressed as follows: where α j and β ij are the model hyper-parameters.p and q denote the number of input and hidden nodes, respec- tively.The hidden layer activation function (g) has taken the form of Rectified Linear Unit (ReLU).

Non-linear support vector regression (NLSVR) model
NLSVR primarily converts the initial input space into a feature space with higher dimensions.Subsequently, a linear regression model is built within it, effectively representing non-linear regression in the original space 21,45 .In the context of a dataset represented by Z = {x i y i } N i=1 , where the input vector x i belongs to the n-dimensional real space, y i represents the scalar output, and N denotes the size of Z, the general equation of NLSVR can be expressed as follows: where w represents the weight vector, φ(x) stands for the non-linear mapping function, b signifies the bias term, and superscript T indicates the transpose operation.The data is used to estimate the coefficients w and b through the minimization of a regularized risk function: Equation (10) comprises two elements: the first is a regularized component represented as half times the norm of w squared, while the second component is referred to as the empirical error, denoted as The regularized risk function effectively balances the optimization of both of these components simultaneously, thus preventing both underfitting and overfitting issues of the model.
(2)  46 .This concept leads to the formulation of the WNN.In a WNN, the activation functions are derived from an orthonormal wavelet basis.The term 'wavelon' is used to refer to the neurons in this context.The output of a wavelon with a single input is defined as follows: Morlet function 47 has been proposed for WNN in this study, which is expressed as follows: This wavelet is derived from a function that bears proportionality to both the cosine function, and normal probability density function.

Convolutional neural network (CNN) model
CNN is a network model introduced by Lecun et al. 48, which has a neural connectivity pattern similar to the visual cortex of animals.CNN has found extensive usage in the realm of image and natural language processing 49 .Nonetheless, it can be effectively employed for time series forecasting.One of the key advantages of CNNs lies in their ability to perceive data locally and share weights, which can greatly lessen the number of parameters and thereby improve learning efficiency.CNN consists of two structural layers: the convolutional layer and the pooling layer 50 .Within the convolution layer, there are several convolutional kernels.The process of convolution can be represented as follows: where tanh is the activation function.l t , x t , k t, and b t represent the output value after convolution, the input vector, the weight of the convolution kernel, and the bias of the convolution kernel, respectively.Following the convolution operation, the main characteristics of the data are retrieved, which is marked by an expansion in the feature dimensions.To address this challenge as well as to lessen the load during training, a pooling layer is introduced before providing the final output 51 .

Recurrent neural network (RNN) model
RNNs pose greater technical complexity compared to feedforward networks, necessitating a solid grasp of dynamic recurrence mechanisms.A basic RNN can be thought of as a single-layer RNN where the activation is delayed and simultaneously looped back with the external input (or the output from a preceding layer).The conventional RNN can be described mathematically as 52 : where t (0, 1, …, N) represents a discrete time point, N signifies the final time in a finite time period, s t refers to a vector with m dimensions representing external inputs at time t, and h t represents the n-dimensional output activation via σ t .This σ t may vary over time and can exhibit non-linear behavior.The non-indexed parameters, to be set via training, are the n × n matrix U, the n × m matrix W, and the n × 1 vector b.

Long short term memory (LSTM) model
Hochreiter and Schmidhuber 53 proposed the LSTM neural network in 1997 to deal with long-term data dependencies.The design of LSTM allows it to recall information for an extended length of time while resolving the problem of vanishing gradient 54,55 , which is the main lacuna of the RNN model 56 .The LSTM model consists of three memory modules, namely forget gate (f t ) , input gate (i t ) , and output gate (o t ) .The primary roles of these three gates are to retain critical information while eliminating unnecessary information from the cell state.The pivotal element in LSTM is the cell state (C t ) , which operates concurrently throughout the entire recurrent chain with a few minor interactions.The structural representation of the LSTM is graphically provided in Fig. 1.
To commence the information processing with the LSTM model, the first step involves removing extraneous information from C t .The forget gate offers this by employing a sigmoid function.
where W f and b f are the weight and bias of the forget gate, respectively, y t is the input value of the current time and h t−1 is the output value of the prior unit.
The cell state is then updated with new or pertinent information.This is accomplished by the use of another gate with a sigmoid function, known as the input gate, and a tanh layer.
where W i , b i, and W c , b c are the weight and bias of the input gate and the candidate input, respectively.In the next step, the current cell state is updated as follows: Then, the output gate takes h t−1 and y t as input values, and its output is calculated using the formula: where W o and b o are the weight and bias of the output gate, respectively.Finally, the output of the LSTM model is computed as follows: Hybrid models Hybrid time series models leverage various modeling approaches to enhance the accuracy and robustness of forecasts.These models consider the time series y t as a blend of both linear and non-linear elements.
where L t and N t represent the linear and non-linear components, respectively.The operational premise of the hybridization approach 57 begins with fitting a linear model to the data and obtaining the corresponding forecast ( L t ) .The subsequent stage entails acquiring the residuals ( e t ) of the linear model and checking for the existence of non-linear patterns in its structure.
Once the residuals validate the presence of non-linearity, they are subsequently fed into an appropriate nonlinear model.After obtaining forecasts ( N t ) for the non-linear component, these are combined with the linear forecasts to generate the aggregate forecasts.
In this investigation, we have used the ARIMAX model to capture the linear component, whereby the residuals are modelled separately by the TDNN, NLSVR, WNN, CNN, RNN, and LSTM models to examine the performance of different combinations.Figure 2 provides the schematic representation of the hybridization technique.

Assessment of forecasting accuracy
To assess the forecasting accuracy of the time series models under investigation, two performance measures, namely the root mean square error (RMSE) and the mean absolute percentage error (MAPE), are employed.The model exhibiting the lowest RMSE and MAPE values is deemed to be the most optimal.
(18) where y t and y t denote the actual and predicted values of the t th observation of the test data, respectively and n is the size of the test data set.

Summary statistics
Table 1 displays summary statistics of the data series utilised in the investigation.Sugarcane has exhibited the highest mean irrigated area (%), with wheat and groundnut trailing behind.A high coefficient of variation value signifies considerable volatility in these series.

Results of the augmented Dickey-Fuller (ADF) test
The ADF test 58,59 has been utilised to ascertain the order of differencing and the results are presented in Table 2.
All the yield series have exhibited non-stationary behavior at level series and stationary behavior at the first difference series.

Fitting of the ARIMA models
In  www.nature.com/scientificreports/minimum values for Akaike information criteria (AIC) and Bayesian information criteria (BIC), as well as the RMSE and MAPE.The parameter estimates of the chosen ARIMA models, along with their significance levels, are provided in Table 3.

Fitting of the ARIMAX models
The selection of a suitable exogenous variable is crucial for the ARIMAX model-building procedure.The significance of the correlation co-efficient between yield and area under irrigation (%) in each case, as reported in

Results of the Broock-Dechert-Scheinkman (BDS) test
Before proceeding to non-linear or hybrid modeling strategies, it is required to examine the series for the presence of non-linear features.To assess non-linearity, the BDS test 60 has been implemented.Table 6 provides the outcomes of the BDS test.For all three cases, a strong rejection of linearity has been observed.It implies that the non-linear models can effectively be implemented in these series.

Fitting of the TDNN models
For this study, we have found the optimal time-delay neural network with a single hidden layer.Experimentation has been carried out to identify the number of tapped delays and hidden nodes.We have altered the range of input and hidden nodes from 1 to 6 and from 1 to 10, respectively for both the original and ARIMAX residual series.The training of networks has been accomplished by utilizing the Levenberg-Marquardt back-propagation algorithm.Table 7 contains the specifications of the chosen TDNN models.

Fitting of the NLSVR models
An essential step in NLSVR modeling is the selection of optimal hyper-parameters.The input lags, kernel function, regularization parameter, kernel width, and margin of tolerance significantly influence the NLSVR performance.In this study, we have employed the widely adopted radial basis function (RBF) as the kernel function.We have constructed NLSVR models for both the original and residual series based on the specifications outlined in Table 8.

Fitting of the WNN models
For each crop, wavelet neural network models have been chosen based on their forecasting performance at various numbers of input lags (from 1 to 6) and hidden nodes (from 1 to 10).Table 9 presents the specifications of the best performing WNN models for the original and residual series, respectively.Similar to TDNN, the Levenberg-Marquardt back-propagation algorithm has been used for training purposes.

Fitting of the CNN models
The performance of the CNN model crucially relies on the optimal selection of hyper-parameters.The set of hyper-parameters tuned for the training of the CNN model consists of the number of input nodes, number of filters and kernel size at the convolution layer, and the pool size at the pooling layer.Based on previous literature, we have used ReLU as an activation function.The best-performing CNN models have been constructed based on the specifications detailed in Table 10.

Fitting of the RNN models
RNNs perform better in predicting sequential, non-linear behavior of the series.Each series has already been duly assessed for the existence of non-linearity.Subsequently, by altering the range of nodes from 1 to 6 at the input layer and 1 to 10 at the hidden layer, the best configuration of hyper-parameters has been acquired.Specifications of the selected RNN models for both original and residual series are provided in Table 11.

Fitting of the LSTM models
The structure of the LSTM models studied in this work comprises an input layer, a hidden layer with LSTM cells as hidden nodes, and an output layer with a single output node.Because of its flexibility to apply multiple learning rates for different parameters, adaptive moment estimation (Adam), a prominent variant of the stochastic gradient descent (SGD) method, has been used for loss function optimization 61 .To obtain the optimal set, multiple configurations of the LSTM model have been explored by varying the different hyper-parameters.
The LSTM tuning involves considering hyper-parameters such as the number of input and hidden nodes, batch size, and the number of epochs.These hyper-parameters serve not only to govern the model's architecture and topology, but also to optimize key parameters like biases and weights.Several automated approaches are available  in the literature that can potentially be used for hyper-parameter tuning.Out of these techniques, we have used the grid search method, which examines all the possible combinations of hyper-parameters.After trying various combinations, substantial effects have been observed for altering the number of input and hidden nodes, whereas the number of epochs and batch size have shown better results when set to 300 and 1, respectively.The trade-off between the number of (trainable) parameters and the error metrics is also considered for giving due weightage to parsimony.The outcomes of the eventual optimised configuration of the input and hidden nodes are depicted in Table 12.

Discussion
The comparative assessment of out-of-sample accuracy for different time series models under consideration is given in Table 13.The ARIMAX-LSTM hybrid models appear to have demonstrated superior performance in forecasting yield series, faring better than all the other models.This implies that the forecasted series obtained through the ARIMAX-LSTM framework tends to align more closely with the actual yield series values.The plots of the original series and predicted series by the best performing ARIMAX-LSTM model are shown in Figs. 3, 4, and 5, respectively.The plots clearly show that the ARIMAX-LSTM hybrid models have effectively captured the trends and trajectories of yield movements.Yield forecasts for the next five years (2021-2025) by these models have also been provided in Table 14. Figure 6 displays a radar plot depicting the error metrics (RMSE and MAPE values) for various models under study.
It is also noteworthy to mention that despite the presumption of linearity, meticulous selection of area under irrigation (%) as an auxiliary variable has mediated the outperformance of the ARIMAX model over the univariate non-linear models.As we confront evolving climatic outlooks, the importance of irrigation will become even more pronounced.By reducing reliance on precipitation and maintaining a balanced level of moisture in the soil, irrigated farming can show more resilience to shifts in weather patterns.However, the supremacy of the ARIMAX-LSTM models over the other ARIMAX-based hybrid models is due to the inbuilt ability of the LSTM model to process any sequential data effectively.Its unique structural make up has helped to provide a more comprehensive view of the complex time series data, encompassing both short-term and long-term insights.The competitive advantages of using LSTM in capturing residual patterns were also evidenced in the studies of Manowska et al. 62 for natural gas consumption forecasting, Wu et al. 63 for precipitation amount and drought forecasting, Dave et al. 64 for export forecasting, Khozani et al. 65 for groundwater level forecasting, etc.
Outcomes emanated from this investigation also suggest the superior performance of all the hybrid models over their individual counterparts.As real-world time series data are subjected to shifts, abrupt changes, and evolving patterns, different forecasting techniques excel in their respective domains 66,67 .Hybrid models can leverage the strengths of various methods to adapt to changing conditions, making them more robust in scenarios where the underlying data-generating process may vary over time.This adaptability is especially valuable for dealing with data from domains where external factors can significantly influence the time series, such as in this case, yield forecasting of Rabi crops 68,69 .In addition, it has been noticed that the ranking of performance among the individual models is distinct from their hybrid counterparts.To illustrate, the performance hierarchy of the non-linear models is as follows: LST M > RNN > WNN > TDNN > CNN > NLSVR.Conversely, within the hybrid framework, a different performance hierarchy has been observed: LSTM > WNN > CNN > RNN > TDNN > NLSVR.It clearly indicates that the comparative performance of individual models cannot be generalized for the comparison of their hybrid structures, emphasizing the importance of data-driven forecasting exercises.

Conclusions
In this study, we have employed different ARIMAX-based hybrid models and compared their performances with their individual counterparts as well as with the ARIMA model for yield forecasting of major Rabi crops in India.It has been observed that the ARIMAX-LSTM modeling combination has provided better forecasts than other time series models, as evidenced by various accuracy measures.For these models, an average improvement of RMSE and MAPE values has been observed to be 10.41% and 12.28%, respectively over all other competing models and 15.83% and 18.42%, respectively over the benchmark ARIMA model.It can also be inferred that the  inclusion of area under irrigation (%) as an exogenous variable in the ARIMAX framework and the inbuilt ability of the LSTM model to process complex non-linear patterns have greatly improved the forecasting accuracy.The performance supremacy of other hybrid models over their individual counterparts has also been evident.It is also suggested to avoid any performance generalization of individual models for their hybrid structures.Future works are expected to explore the performance of other hybrid structures such as ARIMA-NARX, ARIMA-NLSVRX, ARIMA-LSTMX, etc. for agricultural yield forecasting. https://doi.org/10.1038/s41598-023-49544-wwww.nature.com/scientificreports/

Figure 1 .
Figure 1.Structural representation of Long Short Term Memory.

Figure 2 .
Figure 2. Schematic diagram of hybrid modelling strategy.

Figure 3 .
Figure 3. Original series and best-predicted series (by the ARIMAX-LSTM model) for wheat yield in India.

Figure 4 .
Figure 4. Original series and best-predicted series (by the ARIMAX-LSTM model) for sugarcane yield in India.

Table 2 .
Results of the ADF test for testing stationarity.

Table 3 .
Parameter estimates of the fitted ARIMA models.

Table 4 .
Correlation coefficients between yield and area under irrigation (%).

Table 5 .
Parameter estimates of the fitted ARIMAX models.

Table 4 ,
indicates a possible outperformance of the ARIMAX models over the traditional ARIMA models.Following the ARIMA model-building process, the optimal model has been chosen based on the lowest AIC, BIC, RMSE, and MAPE values.Table5displays the specifications of the selected ARIMAX models.

Table 6 .
Results of the BDS test for testing non-linearity.

Table 7 .
Optimal hyper-parameter values of the fitted TDNN models.

Table 8 .
Optimal hyper-parameter values of the fitted NLSVR models.

Table 9 .
Optimal hyper-parameter values of the fitted WNN models.

Table 10 .
Optimal hyper-parameter values of the fitted CNN models.

Table 11 .
Optimal hyper-parameter values of the fitted RNN models.

Table 12 .
Optimal hyper-parameter values of the fitted LSTM models.

Table 13 .
Forecasting accuracies of the time series models under study.